Movie Rating Model and Predictor
Movie Rating Model and Predictor
Part 1: Data
The data were comprised of audience and critics opinions, awards, studio, and actor information from Rotten Tomatoes, imdb, and BoxOfficeMojo.com for a random sample of 651 movies produced and released prior to 2016.
Data Sources
Rotten Tomatoes
Launched in August 1998 by Senh Duong, Rotten Tomatoes is an American review aggregation website for film and television.
IMDB
Generalizability
Selected Features
The full codebook for the data set can be found in Appendix A. Table 1 lists the data variables selected from the raw data that were included at this stage in this study.Table 1: Selected features
| Type | Variable | Description |
|---|---|---|
| Overview | ||
| Categorical | title_type | Type of movie (Documentary, Feature Film, TV Movie) |
| Categorical | genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| Categorical | mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| Organization | ||
| Categorical | studio | Studio that produced the movie |
| Categorical | director | Director of the movie |
| Dates | ||
| Categorical | thtr_rel_month | Month the movie is released in theaters |
| Performance | ||
| Numeric | imdb_num_votes | Number of votes on IMDB |
| Numeric | imdb_rating | Rating on IMDB |
| Numeric | critics_score | Critics score on Rotten Tomatoes |
| Numeric | audience_score | Audience score on Rotten Tomatoes |
| Categorical | best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) |
| Categorical | best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) |
| Categorical | best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie |
| Categorical | best_actress win | Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie |
| Categorical | best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie |
| Categorical | top200_box | Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes) |
| Box Office | ||
| Numeric | box_office | Box office revenue from BoxOfficeMojo.com |
| Numeric | box_office_log | Log of Box office revenue from BoxOfficeMojo.com |
Note that the values for the box office variables were obtained for a random sample of 100 reviews from the BoxOfficeMojo.com website.
Derived Features
The following additional features (Table 2) were derived from selected features and are as follows:Table 2: Derived features
| Type | Variable | Description |
|---|---|---|
| Dates | ||
| Categorical | thtr_rel_season | Season the movie was released in theaters |
| Experience | ||
| Numeric | studio_experience | Total number of films for studio in sample |
| Numeric | director_experience | Total number of films in sample for a director |
| Numeric | cast_experience | The sum across all cast members for a film, of the number of films in which each actor appeared |
| Performance | ||
| Numeric | imdb_num_votes_log | Log number of IMDB votes |
| Numeric | studio_votes | Total IMDB votes for the studio for each film |
| Numeric | studio_votes_log | Log total IMDB votes for the studio for each film |
| Numeric | cast_votes | Total number of allocated IMDB votes for the cast of a film |
| Numeric | cast_votes_log | Log of the total number of allocated IMDB votes for the cast of a film |
| Interaction | ||
| Numeric | scores | 10 * IMDB Rating + critics score + audience_score |
| Numeric | scores_log | Log(10 * IMDB Rating + critics score + audience_score) |
| Numeric | votes_imdb_rating | Votes * imdb_rating |
| Numeric | votes_imdb_rating_log | log(votes * imdb_rating) |
| Numeric | votes_critics_score | votes * critics_score |
| Numeric | votes_critics_score_log | log(votes * critics_score) |
| Numeric | votes_audience_score | votes * audience_score |
| Numeric | votes_audience_score_log | log(votes * audience_score) |
| Numeric | votes_scores | votes * scores |
| Numeric | votes_scores_log | log(votes * scores) |
Cast experience and cast votes for each film were computed as follows:
Cast Experience
Cast experience for each film was defined by:
\[e = \displaystyle\sum_{i=1}^{5} N_i\] where:
\(e\) is the total cast experience for the film
\(N_i\) is the total number of films in which actor \(i\) was involved
Cast Votes Cast votes for each film was defined by:
\[v = \displaystyle\sum_{i=1}^{5} V_i\] where:
\(v\) is the sum of IMDB cast votes for the film
\(V_i\) is the sum of allocated IMDB cast votes for actor \(i\)
imdb votes were allocated to cast members as follows:
* 40% of total film IMDB votes for actor1
* 30% of total film IMDB votes for actor2
* 15% of total film IMDB votes for actor3
* 10% of total film IMDB votes for actor4
* 5% of total film IMDB votes for actor5
Each actor was allocated points accordingly, then the votes were aggregated for each film in which the cast member appeared. The IMDB votes were counted without regard for date to compensate for the limitations imposed by the sample size as movegoers had access to the population of reviews and director, studio, and actor performance data when making their purchase decision.
Omitted Features
The features listed in Table 3 were not included for redundancy reasons or due to the lack of direct relevance to the research question. Some variables, such as the actor variables, were used to derive other variables which are further described below.
Table 3: Omitted features| Variable | Description |
|---|---|
| title | Title of movie |
| runtime | Runtime of movie (in minutes) |
| imdb_url | Link to IMDB page for the movie |
| rt_url | Link to Rotten Tomatoes page for the movie |
| actor1 | First main actor/actress in the abridged cast of the movie |
| actor2 | Second main actor/actress in the abridged cast of the movie |
| actor3 | Third main actor/actress in the abridged cast of the movie |
| actor4 | Fourth main actor/actress in the abridged cast of the movie |
| actor5 | Fifth main actor/actress in the abridged cast of the movie |
| thtr_rel_year | Year the movie is released in theaters |
| thtr_rel_day | Day of the month the movie is released in theaters |
| dvd_rel_year | Year the movie is released on DVD |
| dvd_rel_month | Month the movie is released on DVD |
| dvd_rel_day | Day of the month the movie is released on DVD |
| critics_rating | Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten) |
| audience_rating | Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright) |
Data Cleaning
The variables of interest were obtained from the data and complete cases were extracted reducing the number of observations from 651 to 627.
Part 2: Research question
The underlying intent of this analysis was to determine the factors that most influence box office success for a film. Since box office revenue was not among the variables included in the raw data set, the first task was to determine which of the selected (or derived) variables would stand as a proxy for box office success. As such the first research question is concretely stated as follows:
> Which of the selected or derived variables is most highly associated / correlated with total lifetime box office revenue
Once this proxy response variable was determined, the features that are most highly associated / correlated with this response variable were examined via the following research question.
> Which features are most highly associated / correlated with the proxy response for box office success
Part 3: Exploratory data analysis
The exploratory data analysis began with a data preprocessing step to extract complete cases, and to create the response and two additional explanatory variables. Next, a univariate analysis examined each variable on a univariate basis. Lastly, a bivariate analysis explored the relationships between the response variable and various candidate predictors.
Univariate Analysis
Univariate Analysis of Categorical Variables
The purpose of the univariate analysis of categorical variables was to examine the relative frequencies and proportions of observations for each level of the categorical level. Categorical levels with fewer than five observations were removed from further analysis.
The categorical variables included at this stage of the analysis are indicated in Table 4.
Table 4: Categorical Variables| Variable | Description |
|---|---|
| title_type | Type of movie (Documentary, Feature Film, TV Movie) |
| genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| studio | Studio that produced the movie |
| director | Director of the movie |
| thtr_rel_season | Season the movie was released in theaters |
| thtr_rel_month | Month the movie is released in theaters |
| best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) |
| best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) |
| best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie |
| best_actress win | Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie |
| best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie |
| top200_box | Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes) |
Title Type
Feature films constituted 92% of the films in the sample. Since the focus of this study was theatrical releases, TV movies, which were included in the raw data were excluded from this analysis.
Figure 1: Films by title type
Genre
The drama genre represented a plurality of the releases in the sample, followed by comedy action & adventure then mystery & suspense. The top four genres account for nearly 80% of the films in the sample. Figure 2: Films by genre
MPAA Rating
Rated R films accounted for over 0% of the releases, followed by PG and PG-13. Collectively, R, PG, and PG-13 rated films represent 90% of the films in the sample. NC-17 films were excluded from this analysis. Figure 3: Films by MPAA Rating
Studio
The data included films from 202 studios. Data with respect to the number of films in the sample per studio are captured in the studio experience variable below.
Director
The work of 511 directors was included in the sample provided for this project. Data with respect to the number of films in the sample per director are captured in the director experience variable below.
Season of Theatrical Release
The plurality of features in the sample were released during the fall and summer months with over 20% opening in the month of December alone. Figure 4: Theatrical releases by season
Month of Theatrical Release
The plurality of features in the sample (31%) were released during the months of January, June, October and December.
Figure 5: Theatrical releases by month
Best Picture
Since the proportion of films nominated for and winning best picture were so small, this variable was not likely to be a good predictor of movie popularity. The bivariate analysis below will illuminate this further.
Figure 6: Best picture nominations and wins
Best Director / Actor / Actress
As indicated in Figure 7, the percentages of films with best director, actor and actress oscars were 7%, 15%, and 11%, respectively. Again, these proportions indicate that oscar awards would not be a good predictor of movie popularity. The bivariate analysis will explore this further.
Figure 7: Best director/actor/actress
Top 200 Box Office
Again, the proportion of films in the Top 200 Box Office list was miniscule indicating that inclusion in the top 200 box office list was not likely to be a good predictor of movie popularity. Figure 8: Frequency and proportion of movies by top 200 box office earnings
Univariate Analysis of Quantitative Variables
The primary aim of this analysis was to examine the distribution of the variables vis-a-vis a normal distribution, and to identify potential outliers. Summary statistics, histograms, boxplots, normal quantile-quantile plots were rendered for each variable. The quantitative variables included at this stage of the analysis are indicated in Table 5.
Table 5: Quantitative Variables| Variable | Description |
|---|---|
| studio_experience | Total number of films for studio in sample |
| director_experience | Total number of films in sample for a director |
| cast_experience | The sum across all cast members for a film, of the number of films in which each actor appeared |
| imdb_num_votes | Number of votes on IMDB |
| imdb_num_votes_log | Log number of IMDB votes |
| imdb_rating | Rating on IMDB |
| critics_score | Critics score on Rotten Tomatoes |
| audience_score | Audience score on Rotten Tomatoes |
| studio_votes | Total IMDB votes for the studio for each film |
| studio_votes_log | Log total IMDB votes for the studio for each film |
| cast_votes | Total number of allocated IMDB votes for the cast of a film |
| cast_votes_log | Log of the total number of allocated IMDB votes for the cast of a film |
| scores | 10 * IMDB Rating + critics score + audience_score |
| scores_log | Log(10 * IMDB Rating + critics score + audience_score) |
| votes_imdb_rating | Votes * imdb_rating |
| votes_imdb_rating_log | log(votes * imdb_rating) |
| votes_critics_score | votes * critics_score |
| votes_critics_score_log | log(votes * critics_score) |
| votes_audience_score | votes * audience_score |
| votes_audience_score_log | log(votes * audience_score) |
| votes_scores | votes * scores |
| votes_scores_log | log(votes * scores) |
| box_office | Box office revenue from BoxOfficeMojo.com |
| box_office_log | Log of Box office revenue from BoxOfficeMojo.com |
Studio Experience
This derived variable measured the relative experience of a given studio and was defined as the sum of the observations for the studio associated with each film.
Table 6: Studio experience summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 1 | 2 | 7 | 11.2 | 18 | 16 | 37 | 0 | 10.72 | 95.5 | -0.04 | 1.02 |
Figure 9: Studio experience histogram and QQ Plot
Figure 10: Studio experience boxplot
Central Tendency: The summary statistics (Table 6) the central tendency for studio experience was 7 films and 11.2 films for the median and mean, respectively.
Dispersion: The standard deviation, s = 10.72, corresponds with a coefficient of variation of 95.5%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (1.02), indicated that the distribution of studio experience was right-skewed. The sample kurtosis (-0.04), indicated that the distribution of studio experience was platykurtic or light-tailed. The histogram and QQ plot in Figure 9 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 10, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 2, 18, and 16, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 42]. Indeed, this confirmed the existence of no outliers.
Director Experience
This derived variable measured the relative experience of a given director and was defined as the sum of the observations for the director associated with each film.
Table 7: Director experience summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 1 | 1 | 1 | 1.5 | 2 | 1 | 4 | 0 | 0.75 | 51.2 | 2.57 | 1.73 |
Figure 11: Director experience histogram and QQ Plot
Figure 12: Director experience boxplot
Central Tendency: The summary statistics (Table 7) the central tendency for director experience was 1 films and 1.5 films for the median and mean, respectively.
Dispersion: The standard deviation, s = 0.75, corresponds with a coefficient of variation of 51.2%, indicating a high degree of dispersion.
Shape of Distribution: The sample skewness (1.73), indicated that the distribution of director experience was right-skewed. The sample kurtosis (2.57), indicated that the distribution of director experience was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 11 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 12, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 1, 2, and 1, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 3.5]. Indeed, this confirmed the existence of 20 outliers. Given the proximity of the outliers to the 1.5xIQR, no effort was made to remove them.
Cast Experience
This derived variable measured the relative experience of a given cast and was defined as the sum of the observations for the cast associated with each film.
Table 8: Cast experience summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 5 | 6 | 7 | 7.5 | 9 | 3 | 15 | 0 | 2.16 | 29 | 0.33 | 0.89 |
Figure 13: Cast experience histogram and QQ Plot
Figure 14: Cast experience boxplot
Central Tendency: The summary statistics (Table 8) the central tendency for cast experience was 7 films and 7.5 films for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.16, corresponds with a coefficient of variation of 29%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (0.89), indicated that the distribution of cast experience was right-skewed. The sample kurtosis (0.33), indicated that the distribution of cast experience was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 13 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 14, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 6, 9, and 3, respectively. This yielded a 1.5xIQR ‘acceptable’ range [1.5, 13.5]. Indeed, this confirmed the existence of 7 outliers. Given the proximity of the outliers to the 1.5xIQR, no effort was made to remove them.
Number of IMDB Votes
This variable captured the number of IMDB votes cast for each film.
Table 9: IMDB votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 183 | 4986 | 15806 | 59291.3 | 60409 | 55423 | 893008 | 0 | 113763.6 | 191.9 | 19.43 | 3.99 |
Figure 15: IMDB votes histogram and QQ Plot
Figure 16: IMDB votes boxplot
Central Tendency: The summary statistics (Table 9) the central tendency for imdb votes was 15,806 votes and 59,291.3 votes for the median and mean, respectively.
Dispersion: The standard deviation, s = 113,763.58, corresponds with a coefficient of variation of 191.9%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (3.99), indicated that the distribution of imdb votes was right-skewed. The sample kurtosis (19.43), indicated that the distribution of imdb votes was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 15 reveal a distribution which departs significantly from normality.
Outliers: The boxplot in Figure 16, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 4,986, 60,409, and 55,423, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 143,543.5]. Indeed, this confirmed the existence of 68 outliers.
Log Number of IMDB Votes
This was a log transformation of the IMDB votes variable.
Table 10: Log IMDB votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 7.5 | 12.3 | 13.9 | 14.1 | 15.9 | 3.6 | 19.8 | 0 | 2.37 | 16.8 | -0.57 | 0.04 |
Figure 17: Log IMDB votes histogram and QQ Plot
Figure 18: Log IMDB votes boxplot
Central Tendency: The summary statistics (Table 10) the central tendency for imdb log votes was 13.9 log votes and 14.1 log votes for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.37, corresponds with a coefficient of variation of 16.8%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (0.04), indicated that the distribution of imdb log votes was approximately symmetric. The sample kurtosis (-0.57), indicated that the distribution of imdb log votes was platykurtic or light-tailed. The histogram and QQ plot in Figure 17 reveal a nearly normal distribution.
Outliers: The boxplot in Figure 18, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 12.3, 15.9, and 3.6, respectively. This yielded a 1.5xIQR ‘acceptable’ range [6.9, 21.3]. Indeed, this confirmed the existence of no outliers.
IMDB Ratings
This variable captured the IMDB rating for each film
Table 11: IMDB rating summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 1.9 | 5.9 | 6.6 | 6.5 | 7.3 | 1.4 | 9 | 0 | 1.09 | 16.8 | 1.27 | -0.89 |
Figure 19: IMDB rating histogram and QQ Plot
Figure 20: IMDB rating boxplot
Central Tendency: The summary statistics (Table 11) the central tendency for imdb rating was 6.6 points and 6.5 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 1.09, corresponds with a coefficient of variation of 16.8%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-0.89), indicated that the distribution of imdb rating was left-skewed. The sample kurtosis (1.27), indicated that the distribution of imdb rating was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 19 reveal a nearly normal distribution.
Outliers: The boxplot in Figure 20, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 5.9, 7.3, and 1.4, respectively. This yielded a 1.5xIQR ‘acceptable’ range [3.8, 9.4]. Indeed, this confirmed the existence of 19 outliers.
Critics Scores
This variable captured the critics scores for each film
Table 12: Critics score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 1 | 33 | 61 | 57.2 | 82 | 49 | 100 | 0 | 28.38 | 49.6 | -1.18 | -0.25 |
Figure 21: Critics score histogram and QQ Plot
Figure 22: Critics score boxplot
Central Tendency: The summary statistics (Table 12) the central tendency for critics score was 61 points and 57.2 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 28.38, corresponds with a coefficient of variation of 49.6%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (-0.25), indicated that the distribution of critics score was approximately symmetric. The sample kurtosis (-1.18), indicated that the distribution of critics score was platykurtic or light-tailed. The histogram and QQ plot in Figure 21 reveals a left skewed distribution that departs from normality.
Outliers: The boxplot in Figure 22, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 33, 82, and 49, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 155.5]. Indeed, this confirmed the existence of no outliers.
Audience Scores
This variable captured the audience scores for each film
Table 13: Audience score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 1 | 33 | 61 | 57.2 | 82 | 49 | 100 | 0 | 28.38 | 49.6 | -1.18 | -0.25 |
Figure 23: Audience score histogram and QQ Plot
Figure 24: Audience score boxplot
Central Tendency: The summary statistics (Table 13) the central tendency for audience score was 61 points and 57.2 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 28.38, corresponds with a coefficient of variation of 49.6%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (-0.25), indicated that the distribution of audience score was approximately symmetric. The sample kurtosis (-1.18), indicated that the distribution of audience score was platykurtic or light-tailed. The histogram and QQ plot in Figure 23 reveals a left skewed distribution that departs from normality.
Outliers: The boxplot in Figure 24, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 33, 82, and 49, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 155.5]. Indeed, this confirmed the existence of no outliers.
Studio Votes
This variable captured the studio votes for each film
Table 14: Studio votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 183 | 50390.5 | 250669 | 903089.8 | 790665 | 740274.5 | 4404677 | 0 | 1347877 | 149.3 | 1.3 | 1.64 |
Figure 25: Studio votes histogram and QQ Plot
Figure 26: Studio votes boxplot
Central Tendency: The summary statistics (Table 14) the central tendency for studio votes was 250,669 votes and 903,089.8 votes for the median and mean, respectively.
Dispersion: The standard deviation, s = 1,347,877.47, corresponds with a coefficient of variation of 149.3%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (1.64), indicated that the distribution of studio votes was right-skewed. The sample kurtosis (1.3), indicated that the distribution of studio votes was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 25 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 26, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 50,390.5, 790,665, and 740,274.5, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 1,901,076.75]. Indeed, this confirmed the existence of 126 outliers.
Log Studio Votes
This is a log transformation of the studio votes variable.
Table 15: Log studio votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 7.5 | 15.6 | 17.9 | 17.5 | 19.6 | 4 | 22.1 | 0 | 3.23 | 18.5 | -0.53 | -0.5 |
Figure 27: Log studio votes histogram and QQ Plot
Figure 28: Log studio votes boxplot
Central Tendency: The summary statistics (Table 15) the central tendency for log studio votes was 17.9 log(votes) and 17.5 log(votes) for the median and mean, respectively.
Dispersion: The standard deviation, s = 3.23, corresponds with a coefficient of variation of 18.5%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-0.5), indicated that the distribution of log studio votes was approximately symmetric. The sample kurtosis (-0.53), indicated that the distribution of log studio votes was platykurtic or light-tailed. The histogram and QQ plot in Figure 27 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 28, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 15.6, 19.6, and 4, respectively. This yielded a 1.5xIQR ‘acceptable’ range [9.6, 25.6]. Indeed, this confirmed the existence of 3 outliers.
Cast Votes
This variable captured the total number of votes allocated to each cast member for a film.
Table 16: Cast votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 183 | 17866 | 76506.8 | 156740.5 | 226478.2 | 208612.2 | 1504872 | 0 | 197381.9 | 125.9 | 6.65 | 2.18 |
Figure 29: Cast votes histogram and QQ Plot
Figure 30: Cast votes boxplot
Central Tendency: The summary statistics (Table 16) the central tendency for cast votes was 76,506.8 votes and 156,740.5 votes for the median and mean, respectively.
Dispersion: The standard deviation, s = 197,381.94, corresponds with a coefficient of variation of 125.9%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (2.18), indicated that the distribution of cast votes was right-skewed. The sample kurtosis (6.65), indicated that the distribution of cast votes was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 29 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 30, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 17,866, 226,478.2, and 208,612.2, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 539,396.5]. Indeed, this confirmed the existence of 28 outliers.
Log Cast Votes
This is a log transformation of the cast votes variable.
Table 17: Log cast votes summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 7.5 | 14.1 | 16.2 | 15.8 | 17.8 | 3.7 | 20.5 | 0 | 2.47 | 15.6 | -0.15 | -0.63 |
Figure 31: Log cast votes histogram and QQ Plot
Figure 32: Log cast votes boxplot
Central Tendency: The summary statistics (Table 17) the central tendency for log cast votes was 16.2 log(votes) and 15.8 log(votes) for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.47, corresponds with a coefficient of variation of 15.6%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-0.63), indicated that the distribution of log cast votes was left-skewed. The sample kurtosis (-0.15), indicated that the distribution of log cast votes was platykurtic or light-tailed. The histogram and QQ plot in Figure 31 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 32, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 14.1, 17.8, and 3.7, respectively. This yielded a 1.5xIQR ‘acceptable’ range [8.55, 23.35]. Indeed, this confirmed the existence of 3 outliers.
Scores
This variable captured the total score for each film defined as 10 * IMDB Rating + critics score + audience_score.
Table 12Table 13Table 18: Scores summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 38 | 144.5 | 185 | 184.1 | 231.5 | 87 | 284 | 0 | 54.75 | 29.7 | -0.82 | -0.32 |
Figure 21Figure 23Figure 33: Scores histogram and QQ Plot
Figure 22Figure 24Figure 34: Scores boxplot
Central Tendency: The summary statistics (Table 12Table 13Table 18) the central tendency for total scores was 185 points and 184.1 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 54.75, corresponds with a coefficient of variation of 29.7%, indicating a moderate degree of dispersion.
Shape of Distribution: The sample skewness (-0.32), indicated that the distribution of total scores was approximately symmetric. The sample kurtosis (-0.82), indicated that the distribution of total scores was platykurtic or light-tailed. The histogram and QQ plot in Figure 21Figure 23Figure 33 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 22Figure 24Figure 34, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 144.5, 231.5, and 87, respectively. This yielded a 1.5xIQR ‘acceptable’ range [14, 362]. Indeed, this confirmed the existence of no outliers.
Log Scores
This is a log transformation of scores variable.
Table 19: Log scores summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 5.2 | 7.2 | 7.5 | 7.4 | 7.9 | 0.7 | 8.1 | 0 | 0.5 | 6.7 | 1.12 | -1.07 |
Figure 35: Log scores histogram and QQ Plot
Figure 36: Log scores boxplot
Central Tendency: The summary statistics (Table 19) the central tendency for log total scores was 7.5 points and 7.4 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 0.5, corresponds with a coefficient of variation of 6.7%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-1.07), indicated that the distribution of log total scores was left-skewed. The sample kurtosis (1.12), indicated that the distribution of log total scores was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 35 reveals a left skewed distribution that departs rather significantly from normality.
Outliers: The boxplot in Figure 36, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 7.2, 7.9, and 0.7, respectively. This yielded a 1.5xIQR ‘acceptable’ range [6.15, 8.95]. Indeed, this confirmed the existence of 14 outliers.
IMDB Votes * Rating
This interaction variable is defined as the product of IMDB votes and IMDB ratings.
Table 20: IMDB Votes * Rating summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 1335.9 | 30708.1 | 99088.8 | 426584.5 | 391389.5 | 360681.5 | 7590568 | 0 | 910840.8 | 213.5 | 24.2 | 4.47 |
Figure 37: IMDB Votes * Rating histogram and QQ Plot
Figure 38: IMDB Votes * Rating votes boxplot
Central Tendency: The summary statistics (Table 20) the central tendency for votes * imdb rating was 99,088.8 points and 426,584.5 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 910,840.83, corresponds with a coefficient of variation of 213.5%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (4.47), indicated that the distribution of votes * imdb rating was right-skewed. The sample kurtosis (24.2), indicated that the distribution of votes * imdb rating was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 37 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 38, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 30,708.1, 391,389.5, and 360,681.5, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 932,411.75]. Indeed, this confirmed the existence of 73 outliers.
Log IMDB Votes * Rating
This is a log transformation of IMDB Votes * Rating variable.
Table 21: Log IMDB Votes * Rating summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 10.4 | 14.9 | 16.6 | 16.8 | 18.6 | 3.7 | 22.9 | 0 | 2.44 | 14.6 | -0.56 | 0.13 |
Figure 39: Log IMDB Votes * Rating histogram and QQ Plot
Figure 40: Log IMDB Votes * Rating boxplot
Central Tendency: The summary statistics (Table 21) the central tendency for log(votes * imdb rating) was 16.6 log points and 16.8 log points for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.44, corresponds with a coefficient of variation of 14.6%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (0.13), indicated that the distribution of log(votes * imdb rating) was approximately symmetric. The sample kurtosis (-0.56), indicated that the distribution of log(votes * imdb rating) was platykurtic or light-tailed. The histogram and QQ plot in Figure 39 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 40, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 14.9, 18.6, and 3.7, respectively. This yielded a 1.5xIQR ‘acceptable’ range [9.35, 24.15]. Indeed, this confirmed the existence of no outliers.
IMDB Votes * Critics Score
This interaction variable is defined as the product of IMDB votes and critics score.
Table 22: IMDB Votes * Critics Score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 11310 | 196248 | 815572 | 4111799 | 3341443 | 3145195 | 78584704 | 0 | 9677896 | 235.4 | 23.77 | 4.51 |
Figure 41: IMDB Votes * Critics Score histogram and QQ Plot
Figure 42: IMDB Votes * Critics Score votes boxplot
Central Tendency: The summary statistics (Table 22) the central tendency for votes * critics score was 815,572 points and 4,111,799.3 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 9,677,896.18, corresponds with a coefficient of variation of 235.4%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (4.51), indicated that the distribution of votes * critics score was right-skewed. The sample kurtosis (23.77), indicated that the distribution of votes * critics score was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 41 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 42, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 196,248, 3,341,442.5, and 3,145,194.5, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 8,059,234.25]. Indeed, this confirmed the existence of 83 outliers.
Log IMDB Votes * Critics Score
This is a log transformation of IMDB Votes * Critics Score variable.
Table 23: Log IMDB Votes * Critics Score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 13.5 | 17.6 | 19.6 | 19.6 | 21.7 | 4.1 | 26.2 | 0 | 2.71 | 13.8 | -0.58 | 0.13 |
Figure 43: Log IMDB Votes * Critics Score histogram and QQ Plot
Figure 44: Log IMDB Votes * Critics Score boxplot
Central Tendency: The summary statistics (Table 23) the central tendency for log(votes * critics score) was 19.6 log points and 19.6 log points for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.71, corresponds with a coefficient of variation of 13.8%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (0.13), indicated that the distribution of log(votes * critics score) was approximately symmetric. The sample kurtosis (-0.58), indicated that the distribution of log(votes * critics score) was platykurtic or light-tailed. The histogram and QQ plot in Figure 43 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 44, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 17.6, 21.7, and 4.1, respectively. This yielded a 1.5xIQR ‘acceptable’ range [11.45, 27.85]. Indeed, this confirmed the existence of no outliers.
IMDB Votes * Audience Score
This interaction variable is defined as the product of IMDB votes and audience score.
Table 24: IMDB Votes * Audience Score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 9750 | 257713.5 | 859140 | 4378112 | 3664785 | 3407072 | 81263728 | 0 | 9805170 | 224 | 24.64 | 4.53 |
Figure 45: IMDB Votes * Audience Score histogram and QQ Plot
Figure 46: IMDB Votes * Audience Score votes boxplot
Central Tendency: The summary statistics (Table 24) the central tendency for votes * audience score was 859,140 points and 4,378,111.7 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 9,805,170.02, corresponds with a coefficient of variation of 224%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (4.53), indicated that the distribution of votes * audience score was right-skewed. The sample kurtosis (24.64), indicated that the distribution of votes * audience score was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 45 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 46, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 257,713.5, 3,664,785, and 3,407,071.5, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 8,775,392.25]. Indeed, this confirmed the existence of 80 outliers.
Log IMDB Votes * Audience Score
This is a log transformation of IMDB Votes * Audience Score variable.
Table 25: Log IMDB Votes * Audience Score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 13.3 | 18 | 19.7 | 20 | 21.8 | 3.8 | 26.3 | 0 | 2.54 | 12.7 | -0.58 | 0.16 |
Figure 47: Log IMDB Votes * Audience Score histogram and QQ Plot
Figure 48: Log IMDB Votes * Audience Score boxplot
Central Tendency: The summary statistics (Table 25) the central tendency for log(votes * audience score) was 19.7 log points and 20 log points for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.54, corresponds with a coefficient of variation of 12.7%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (0.16), indicated that the distribution of log(votes * audience score) was approximately symmetric. The sample kurtosis (-0.58), indicated that the distribution of log(votes * audience score) was platykurtic or light-tailed. The histogram and QQ plot in Figure 47 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 48, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 18, 21.8, and 3.8, respectively. This yielded a 1.5xIQR ‘acceptable’ range [12.3, 27.5]. Indeed, this confirmed the existence of no outliers.
IMDB Votes * Total Score
This interaction variable is defined as the product of IMDB votes and total score.
Table 26: IMDB Votes * Total Score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 42822 | 829162 | 2699180 | 12755756 | 10554420 | 9725258 | 235754112 | 0 | 28491864 | 223.4 | 24.13 | 4.5 |
Figure 49: IMDB Votes * Total Score histogram and QQ Plot
Figure 50: IMDB Votes * Total Score votes boxplot
Central Tendency: The summary statistics (Table 26) the central tendency for votes * total score was 2,699,180 points and 12,755,756.4 points for the median and mean, respectively.
Dispersion: The standard deviation, s = 28,491,863.84, corresponds with a coefficient of variation of 223.4%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (4.5), indicated that the distribution of votes * total score was right-skewed. The sample kurtosis (24.13), indicated that the distribution of votes * total score was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 49 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 50, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 829,162, 10,554,419.5, and 9,725,257.5, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 25,142,305.75]. Indeed, this confirmed the existence of 81 outliers.
Log IMDB Votes * Total Score
This is a log transformation of IMDB Votes * Total Score variable.
Table 27: Log IMDB Votes * Total Score summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627 | 15.4 | 19.7 | 21.4 | 21.5 | 23.3 | 3.7 | 27.8 | 0 | 2.5 | 11.6 | -0.56 | 0.18 |
Figure 51: Log IMDB Votes * Total Score histogram and QQ Plot
Figure 52: Log IMDB Votes * Total Score boxplot
Central Tendency: The summary statistics (Table 27) the central tendency for log(votes * total score) was 21.4 log points and 21.5 log points for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.5, corresponds with a coefficient of variation of 11.6%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (0.18), indicated that the distribution of log(votes * total score) was approximately symmetric. The sample kurtosis (-0.56), indicated that the distribution of log(votes * total score) was platykurtic or light-tailed. The histogram and QQ plot in Figure 51 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 52, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 19.7, 23.3, and 3.7, respectively. This yielded a 1.5xIQR ‘acceptable’ range [14.15, 28.85]. Indeed, this confirmed the existence of no outliers.
Box Office
Total lifetime box office revenue was obtained for a subset of 100 randomly selected films from the movie data set. This is an analysis of box office revenue for this random sampling.
Table 28: Box office revenue summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100 | 166057 | 3157679 | 15902968 | 48640856 | 60625820 | 57468141 | 658672302 | 0 | 87341393 | 179.6 | 24.65 | 4.37 |
Figure 53: Box office revenue histogram and QQ Plot
Figure 54: Box office revenue boxplot
Central Tendency: The summary statistics (Table 28) the central tendency for box office was 15,902,968 dollars and 48,640,855.6 dollars for the median and mean, respectively.
Dispersion: The standard deviation, s = 87,341,393.06, corresponds with a coefficient of variation of 179.6%, indicating a very high degree of dispersion.
Shape of Distribution: The sample skewness (4.37), indicated that the distribution of box office was right-skewed. The sample kurtosis (24.65), indicated that the distribution of box office was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 53 reveals a left skewed distribution that departs significantly from normality.
Outliers: The boxplot in Figure 54, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 3,157,678.8, 60,625,819.5, and 57,468,140.8, respectively. This yielded a 1.5xIQR ‘acceptable’ range [0, 146,828,030.7]. Indeed, this confirmed the existence of 7 outliers.
Log Box Office
This is a log transformation of the box office variable.
Table 29: Log box office revenue summary statistics| N | Min | Q1 | Median | Mean | Q3 | IQR | Max | NA.s | SD | CV | Kurtosis | Skewness |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100 | 17.3 | 21.6 | 23.9 | 23.6 | 25.9 | 4.3 | 29.3 | 0 | 2.82 | 11.9 | -0.56 | -0.45 |
Figure 55: Log box office revenue histogram and QQ Plot
Figure 56: Log box office revenue boxplot
Central Tendency: The summary statistics (Table 29) the central tendency for log box office was 23.9 log(dollars) and 23.6 log(dollars) for the median and mean, respectively.
Dispersion: The standard deviation, s = 2.82, corresponds with a coefficient of variation of 11.9%, indicating a low degree of dispersion.
Shape of Distribution: The sample skewness (-0.45), indicated that the distribution of log box office was approximately symmetric. The sample kurtosis (-0.56), indicated that the distribution of log box office was platykurtic or light-tailed. The histogram and QQ plot in Figure 55 reveals a left skewed distribution that approximates normality.
Outliers: The boxplot in Figure 56, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 21.6, 25.9, and 4.3, respectively. This yielded a 1.5xIQR ‘acceptable’ range [15.15, 32.35]. Indeed, this confirmed the existence of no outliers.
Bivariate Analysis
The objective at this stage is to ascertain the correlation (quantiative independent variable) or the association (categorical independent variable) between movie popularity and the following candidate predictors. To ascertain the suitability of a candidate predictor, statistical inference (i.e., hypothesis testing) was conducted to draw conclusions about how movie popularity relates to various factors, based on the sample of popularity and the explanatory variables. Once conditions were checked, the appropriate Anova / Regression (parametric) or Mann–Whitney U test/ Kruskal-Wallis (non-parametric) tests were conducted. The confidence level for all tests was 95%, yielding a two-sided \(\alpha = 0.05\). Decisions were made w.r.t. the relationship between movie popularity and the following factors based upon the probability of observing a test statistic as extreme as the one observed, given the null hypothesis (equal means/ zero slope) was true.
Having introduced each of the variables and created new ones, twelve independent variables were selected for this next stage bivariate analysis and they are listed in Table 30. Table 30: Candidate predictors| Variable | Description |
|---|---|
| genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| studio | Studio that produced the movie |
| director | Director of the movie |
| thtr_rel_season | Season the movie was released in theaters |
| thtr_rel_month | Month the movie is released in theaters |
| studio_experience | Total number of films for studio in sample |
| director_experience | Total number of films in sample for a director |
| cast_experience | The sum across all cast members for a film, of the number of films in which each actor appeared |
| imdb_num_votes | Number of votes on IMDB |
| imdb_num_votes_log | Log number of IMDB votes |
| imdb_rating | Rating on IMDB |
| critics_score | Critics score on Rotten Tomatoes |
| audience_score | Audience score on Rotten Tomatoes |
| best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) |
| best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) |
| best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie |
| best_actress win | Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie |
| best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie |
| top200_box | Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes) |
| studio_votes | Total IMDB votes for the studio for each film |
| studio_votes_log | Log total IMDB votes for the studio for each film |
| cast_votes | Total number of allocated IMDB votes for the cast of a film |
| cast_votes_log | Log of the total number of allocated IMDB votes for the cast of a film |
| scores | 10 * IMDB Rating + critics score + audience_score |
| scores_log | Log(10 * IMDB Rating + critics score + audience_score) |
| votes_imdb_rating | Votes * imdb_rating |
| votes_imdb_rating_log | log(votes * imdb_rating) |
| votes_critics_score | votes * critics_score |
| votes_critics_score_log | log(votes * critics_score) |
| votes_audience_score | votes * audience_score |
| votes_audience_score_log | log(votes * audience_score) |
| votes_scores | votes * scores |
| votes_scores_log | log(votes * scores) |
| box_office | Box office revenue from BoxOfficeMojo.com |
| box_office_log | Log of Box office revenue from BoxOfficeMojo.com |
Certain variables such as website addresses, film titles and runtimes provided no popularity predictive value. Similarly, studio, director, and actor variables were excluded in favor of their popularity and experience measures. The day of theatrical release as well as DVD release dates were not of interest for this analysis. Categorical scoring variables were excluded in favor of numeric measures. Lastly, dichotomous variables such as the oscar wins and inclusion in the box off top 200 provided insufficient sample size for one (or more) of these levels, as such they were excluded.
Genre
The hypothesis for the association between genre and movie popularity was as follows:
\(H_0\):
Part 4: Modeling
To ascertain the suitability of a candidate predictor, statistical inference (i.e., hypothesis testing) was conducted to draw conclusions about how movie popularity relates to various factors, based on the sample of popularity and the explanatory values.
The relationship between movie popularity and an explanatory variable can be described by the equation \(Y=β0+β1x\) where:
\(Y\) is the movie popularity score
\(β0\) is the \(y\)-intercept of the regression line
\(β1\) is the slope of the regression line
\(x\) is the coded value for the title type
The following analysis is only interested in the statistical significance of the slope, \(β1\), whereas \(β1 \neq 0\) indicates that the explanatory variable \(x\) can be used to predict \(Y\), movie popularity.
Before making any inferences, the conditions for inference were checked. For categorical variables, linearity, independence of errors, normality of errors, and equal error variance was checked. Next, hypotheses statements were tested whereby \(H_0\): \(β1 = 0\) and \(H_a\): \(β1 \neq 0\). The confidence level for all tests was 95%, with a two-tailed \(\alhha = 0.05\). Two test statistics were used: (1) the \(t\)-statistic and (2) the \(F\) statistic for analysis of variance.
Observations included/omitted - title_type == TV removed
Table 31: Forward Selection Prediction Model
Table 32: Backward Elimination Prediction Model
Part 5: Prediction
NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.
Part 6: Conclusion
Appendix
Appendix A: Codebook
Table 33: Movie data set codebook| Source | Type | Variable | Description |
|---|---|---|---|
| General | |||
| IMDB/RT/BO | Categorical | title | Title of movie |
| IMDB/RT/BO | Categorical | title_type | Type of movie (Documentary, Feature Film, TV Movie) |
| IMDB/RT/BO | Categorical | genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| IMDB/RT/BO | Numeric | runtime | Runtime of movie (in minutes) |
| IMDB/RT/BO | Categorical | mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| IMDB/RT/BO | Categorical | imdb_url | Link to IMDB page for the movie |
| IMDB/RT/BO | Categorical | rt_url | Link to Rotten Tomatoes page for the movie |
| Organization | |||
| IMDB/RT/BO | Categorical | studio | Studio that produced the movie |
| IMDB/RT/BO | Categorical | director | Director of the movie |
| IMDB/RT/BO | Categorical | actor1 | First main actor/actress in the abridged cast of the movie |
| IMDB/RT/BO | Categorical | actor2 | Second main actor/actress in the abridged cast of the movie |
| IMDB/RT/BO | Categorical | actor3 | Third main actor/actress in the abridged cast of the movie |
| IMDB/RT/BO | Categorical | actor4 | Fourth main actor/actress in the abridged cast of the movie |
| IMDB/RT/BO | Categorical | actor5 | Fifth main actor/actress in the abridged cast of the movie |
| Dates | |||
| IMDB/RT/BO | Categorical | thtr_rel_year | Year the movie is released in theaters |
| Derived | Categorical | thtr_rel_season | Season the movie was released in theaters |
| IMDB/RT/BO | Categorical | thtr_rel_month | Month the movie is released in theaters |
| IMDB/RT/BO | Categorical | thtr_rel_day | Day of the month the movie is released in theaters |
| IMDB/RT/BO | Categorical | dvd_rel_year | Year the movie is released on DVD |
| IMDB/RT/BO | Categorical | dvd_rel_month | Month the movie is released on DVD |
| IMDB/RT/BO | Categorical | dvd_rel_day | Day of the month the movie is released on DVD |
| Experience | |||
| Derived | Numeric | studio_experience | Total number of films for studio in sample |
| Derived | Numeric | director_experience | Total number of films in sample for a director |
| Derived | Numeric | cast_experience | The sum across all cast members for a film, of the number of films in which each actor appeared |
| Performance | |||
| IMDB/RT/BO | Numeric | imdb_num_votes | Number of votes on IMDB |
| Derived | Numeric | imdb_num_votes_log | Log number of IMDB votes |
| IMDB/RT/BO | Numeric | imdb_rating | Rating on IMDB |
| IMDB/RT/BO | Categorical | critics_rating | Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten) |
| IMDB/RT/BO | Categorical | audience_rating | Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright) |
| IMDB/RT/BO | Numeric | critics_score | Critics score on Rotten Tomatoes |
| IMDB/RT/BO | Numeric | audience_score | Audience score on Rotten Tomatoes |
| IMDB/RT/BO | Categorical | best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) |
| IMDB/RT/BO | Categorical | best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) |
| IMDB/RT/BO | Categorical | best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie |
| IMDB/RT/BO | Categorical | best_actress win | Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie |
| IMDB/RT/BO | Categorical | best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie |
| IMDB/RT/BO | Categorical | top200_box | Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes) |
| Derived | Numeric | studio_votes | Total IMDB votes for the studio for each film |
| Derived | Numeric | studio_votes_log | Log total IMDB votes for the studio for each film |
| Derived | Numeric | cast_votes | Total number of allocated IMDB votes for the cast of a film |
| Derived | Numeric | cast_votes_log | Log of the total number of allocated IMDB votes for the cast of a film |
| Interaction | |||
| Derived | Numeric | scores | 10 * IMDB Rating + critics score + audience_score |
| Derived | Numeric | scores_log | Log(10 * IMDB Rating + critics score + audience_score) |
| Derived | Numeric | votes_imdb_rating | Votes * imdb_rating |
| Derived | Numeric | votes_imdb_rating_log | log(votes * imdb_rating) |
| Derived | Numeric | votes_critics_score | votes * critics_score |
| Derived | Numeric | votes_critics_score_log | log(votes * critics_score) |
| Derived | Numeric | votes_audience_score | votes * audience_score |
| Derived | Numeric | votes_audience_score_log | log(votes * audience_score) |
| Derived | Numeric | votes_scores | votes * scores |
| Derived | Numeric | votes_scores_log | log(votes * scores) |
| Box Office | |||
| IMDB/RT/BO | Numeric | box_office | Box office revenue from BoxOfficeMojo.com |
| IMDB/RT/BO | Numeric | box_office_log | Log of Box office revenue from BoxOfficeMojo.com |